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1  Introduction 

We  take  part  in  the  opinion  and  polarity  retrieval  tasks  of  the  blog  track. 

A  test  collection,  called  Blog06,  was  created  for  the  blog  track  in  2006  [4] 
with  three  main  different  components:  feeds,  permalinks  and  home-pages.  The 
collection  contains  spam  as  well  as  possibly  no  blogs  and  no  english  pages.  For  our 
experimentation  only  permalinks  have  been  taken  into  consideration,  consisting 
of  3.2  million  of  Web  pages  for  a  total  of  88.8GB,  each  one  containing  a  post 
and  its  related  comments. 

The  evaluation  metrics  are  precision/recall  based  [4],  the  Mean  Average  Pre¬ 
cision  (MAP)  and  R-Precision  (RPrec),  but  we  also  focused  on  Precision  at  10 
(P@10),  due  to  its  relevance  in  evaluating  the  effectiveness  of  Web  search  engines 
[5]  [3], 

As  in  2007,  we  based  our  approch  on  the  costruction  of  ad-hoc  weighted 
dictionaries,  containing  terms  assumed  to  be  used  to  express  a  sentiment.  The 
weight  is  a  measure  of  how  much  sentiment  the  term  expresses. 

To  automatically  construct  our  dictionaries,  we  assumed  that  11  opinion-bearing” 
words  distribute  more  randomly  in  the  set  of  opinionated  documents  than  semantic- 
bearing  terms,  but  less  randomly  than  not-informative  terms. 

As  a  consequence,  we  relyed  on  two  theoretic  measures.  The  first  of  them  was 
based  on  a  Divergence  From  Randomness  (DFR)  model  and  defined  the  weight 
of  each  term  within  an  opinionated  document,  conseguently  identifing  the  set 
of  terms  candidate  to  appear  in  the  vocabularies.  The  other  one,  was  based  on 
entropy  maximization  in  the  set  of  all  relevant  and  opinionated  documents  and 
defined  the  final  content  of  the  dictionaries  and  the  weights  of  their  terms. 

By  these  dictionaries,  we  first  reranked  the  set  of  documents  relevant  to  a 
topic  on  the  basis  of  the  quantity  of  opinion  they  express,  and  then  extract  two 
new  rankings  according  to  the  polarity  of  the  expressed  sentiment. 

All  these  phases  are  detailed  described  in  Sections  2,  3,  4,  5  and  6.  Finally, 
in  Section  7  we  report  and  discuss  on  the  experimentation  activity  and  results. 
Finally,  a  brief  analysis  of  our  results  is  present  in  8. 
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2  Data  preprocessing 


As  in  2007  [5],  data  preprocessing  mainly  consisted  in  trying  to  remove  no  english 
documents  from  the  collection  through  LingPipe  [1],  In  our  intention  the  tool 
should  also  succed  in  detecting  some  of  the  spam.  A  deeper  analysis,  than  those 
conducted  in  2007,  of  the  effectiveness  of  this  approch,  revealed  that  a  consistent 
fraction  of  relevant  documents  were  wrongly  identified  as  written  in  a  language 
other  than  english. 

Unfortunately  we  hadn’t  enough  time  to  test  alternative  training  modalities 
of  the  LingPipe  or  to  evaluate  complete  different  approches  to  solve  the  problem. 
Thus,  we  have  been  forced  to  deal  with  the  original  collection,  spam  and  no 
english  documents  included. 

3  Topic  relevance  retrieval 

For  the  retrieval  of  the  documents  relevant  to  a  topic,  we  basically  followed  the 
same  approch  adopted  in  2007,  with  only  few  exceptions:  we  did  not  rely  on  the 
distributed  implementation  of  Terrier  [2]  to  build  our  indexes,  while  DFReel,  a 
parameter  free  retrieval  model,  has  been  adopted  instead  of  DPH.  The  stemmimg 
modalities  and  the  choice  of  the  parametric  PL2  model,  with  c  set  to  9,  stayed 
unaffected. 

Table  1  shows  the  values  of  the  MAP,  the  R-Prec  and  the  P@10  for  the 
topic  relevance  retrieval  baselines  we  submitted  to  TREC  2008  (BLJDFRee  and 
BLCPL2c9  respectively  for  the  DFReel  and  PL2  retrieval  models).  Together  with 
the  same  values  for  the  baselines  provided  by  NIST  (BL1,  BL2,  BL3,  BL4,  BL5). 

4  Automatic  costruction  of  ad-hoc  dictionaries 

Our  approch  is  based  on  the  costruction  of  three  ad-hoc  weighted  dictionaries: 
one  for  the  opinion  retrieval,  Opinio,  and  the  other  two  for  the  polarity  detection, 
PosT>  and  N eg D.  Before  entering  the  details  of  the  construction,  let  us  introduce 
a  little  bit  of  notation.  Let: 

—  C  denote  the  collection  of  documents; 

—  7 ZCC  denote  the  set  of  documents  relevant  to  a  topic; 

-OCR  denote  the  set  of  documents  relevant  to  the  same  topic,  expressing 
an  opinion  on  it; 

In  automatically  constructing  OpinD  [3]  [5],  we  assumed  that: 

—  content-bearing  words  maximize  the  probability  P  of  observing  the  posterior 
probability  of  occurrence  in  O,  given  the  prior  probability  of  occurrence  in 

n. 

—  opinion-bearing  words,  instead,  minimize  the  same  probability.  The  weight  of 
an  opinion-bearing  word  is  provided  by  a  DFR  model,  and,  as  a  consequence, 
a  word  is  assumed  to  express  an  opinion  iff  it  maximizes  the  value  of  the 
computed  divergence. 


—  best  opinion-bearing  words  also  maximize  the  entropy  in  O.  In  our  assump¬ 
tion,  the  approach  we  adopted  to  maximize  entropy  is  to  select  the  terms 
with  highest  divergence,  that,  at  the  same  time,  belong  to  a  large  enough 
number  of  opinionated  documents. 

Starting  from  our  assumptions,  1Z  has  been  identified  as  the  set  of  documents 
recognized  as  relevant  by  TREC  2006-2007,  those  labeled  1,  2,  3  or  4  in  the 
provided  qrels;  while  O  as  the  set  of  opinionated  ones,  those  labeled  2,  3  or  4  in 
the  same  qrels  [4]. 

The  DFreel  DFR  has  been  adopted  to  identify  the  set  of  opinion-bearing 
words.  The  set  of  best  opinion-bearing  words  is  then  obtained  as  follows:  a 
sequence  of  candidate  dictionaries  D\  D  £>2  D  •  •  •  D  Dfc,  with  D\  coincid¬ 
ing  with  the  set  of  opinion-bearing  words,  has  been  computed  such  that  Vi  = 
1, . . . ,  n  Di  =  {t  €  Di  A  dft  >  i}  where  dft  is  the  document  frequency  of  term  t 
in  O  [3]  [5]  . 

As  result,  a  generic  k  level  dictionary  contains  all  opinion-bearing  terms 
occurring  in  at  least  k  documents  in  O.  Our  final  goal  was  to  find  the  maximum 
value  of  k,  say  k,  that  keeps  the  retrieval  performance  stable,  when  compared 
with  those  obtained  by  D\.,  with  k  <  k;  and  maintains  the  dictionary  size  small 
engough  to  be  computationally  effective.  The  value  of  k  best  fitting  our  needs 
has  been  tentatively  fixed  to  1,  000. 

PosD  and  Neg D  respectively  are  analogously  determined:  all  the  above  as¬ 
sumptions  and  considerations  still  hold  if  1Z  is  substituted  by  O ,  and  O  by 
0+  and  0~ ,  respectively,  where  0+  (resp.  0~)  denotes  the  set  of  documents 
expressing  a  positive  (resp.,  negative)  opinion. 

This  time  O  has  been  identified  as  the  set  of  documents  recognized  as  posi¬ 
tively  and  negatively  opinionated  by  TREC  2006-2007,  those  labeled  2  or  4  in  the 
provided  qrels.  The  value  of  k  best  fitting  our  needs  has  been  tentatively  fixed  to 
500  for  PosD,  and  to  100  for  NegT>.  Since  weights  assigned  to  terms  appeared 
to  be  significantly  dissimilar  between  the  two  dictionaries,  the  weights  of  each 
dictionary  have  been  normalized  to  the  highest  value  inside  of  the  dictionary 
itself. 

5  Opinionated  relevance  retrieval 

Opinionated  and  relevant  documents  was  ranked,  for  each  query  q,  in  three 
steps: 

1.  a  topic  retrieval  step  was  accomplished,  as  decribed  in  section  3:  a  new  rank, 
say  content  jrank{ d 1 1 q) ,  was  assigned  to  each  document  d,  depending  on  the 
score,  say  content score( d||q),  assigned  to  it  by  the  adopted  DFR  model; 

2.  a  new  query,  maden  by  all  the  terms  in  OpinD,  weighted  by  their  respec¬ 
tive  weights,  was  submitted:  a  new  score  was  obtained  for  each  document 
d,  say  opinion  score(d\ \OpinD).  A  new  rank,  say  opinion jrank(d\ |q),  for 
each  document  d  was  then  obtained  on  the  basis  of  opinions  cor  e{  d||q)  = 
opinion  score(d\\OpinV)  /  content  crank(  d|  |q); 


3.  the  final  ranking  was  obtained  by  furtherly  boosting  the  rank  assigned  to 
each  document  d,  say  content  score+  (d||q),  as  follows: 
content  scor  e+  ( d||q)  =  content  scor  e(  d|  \q)  /  opinion  jrank(d\  |q). 

6  Polarity  Recognition 

Polarity  recognition  is  accomplished  with  an  approach  similar  to  that  adopted 
for  the  opinionated  relevance  retrieval.  The  starting  point  is  the  opinion  ranking 
determined  according  to  the  modalities  described  in  section  5.  The  computation 
of  the  polarity  rank  is  based  on  the  weights  assigned  to  the  terms  in  PosD  and 
in  NegD.  The  final  polarity  score  of  a  document  is  obtained  by  subtracting  to 
its  positive  polarity  score  its  negative  one.  If  the  final  score  is  greater  than  zero, 
the  document  is  considered  as  expressing  a  positive  opinion;  a  negative  opinion, 
otherwise.  Finally  if  this  score  is  close  to  zero,  we  consider  the  document  as  not 
sufficently  polarized. 

7  Tests  and  results 

We  first  of  all  generate  our  baselines,  one  for  each  topic  of  interest,  by  ranking 
the  documents  according  to  the  content  they  bear:  table  1  shows  the  mean  of 
the  values  of  the  topic  relevance  MAP,  R-Prec  and  P@10  for  our  baselines,  rows 
BLdDFRee  and  BL_PL2c9  for  the  DFReel  and  PL2  retrieval  models,  together 
with  the  same  values  for  the  baselines  provided  by  the  NIST.  As  shown  by  the 
table,  in  no  case  we  succeeded  to  improve  the  baselines  of  reference. 

Next,  each  baseline  is  re-ranked  according  to  the  quantity  of  opinion  its  doc¬ 
uments  bear.  These  new  rankings  will  be  referred  to  as  opinion  based  rankings. 
To  asses  the  effectiveness  of  our  approach,  we  first  of  all  investigate  its  impact 
on  the  baselines:  we  compared  the  MAP  and  the  R.-Precision  values  of  table  1, 
with  the  corrisponding  values  for  the  opinion  based  rankings  generated  using  the 
DFReel  and  PL2  models,  shown  by  tables  2  and  3,  respectively.  These  tables 
also  show  the  results  of  the  comparison. 

We  then  compared  the  MAP  and  the  R.-Prec  values  for  the  opinion  relevance 
of  the  baselines,  shown  by  table  4,  with  the  corrisponding  values  for  the  opinion 
based  rankings  generated  using  the  DFReel  and  PL2  models,  shown  by  tables 
5  and  6,  respectively.  Table  7  shows  the  results  of  this  comparison. 

Furthermore,  for  each  topic  we  have  been  given  the  medians  of  the  opinion 
relevance  MAP  and  R.-Prec  for  all  the  runs  submitted  by  all  the  participants. 
We  compute  the  means  of  this  values,  0.3050  for  the  MAP  and  0.3651  for  the 
R.-Prec,  and  compare  them  with  our  results,  as  shown  by  tables  5  and  6. 

Finally,  documents  in  each  of  the  opinion  based  rankings  are  filtered  accord¬ 
ing  to  the  positive  (resp.,  negative)  polarity  of  the  opinion  they  bear,  resulting 
in  two  new  rankings,  one  of  documents  expressing  a  positive  opinion,  the  other 
of  documents  expressing  a  negative  one.  It  is  worth  noting  that  the  sets  of  doc¬ 
uments  in  these  rankings,  do  not  intersect  and  are  a  subset  of  the  documents 
appearing  in  the  original  opinion  based  ranking.  This  implies  that  the  MAP  and 


R-Prec  values  for  these  rankings  can  not  be  directly  compared  with  those  for 
the  opinion  based  rankings. 

As  a  consequence  we  limited  ourselves  to  investigate  the  effectiveness  of  our 
approach  by  comparing  the  medians  of  the  polarity  relevance  MAP,  i.e.  0.1151, 
and  R.-Prec,  i.e.  0.1624,  for  the  runs  submitted  by  all  participans  with  our  results, 
as  shown  by  tables  8  and  9. 

8  Conclusion 

By  our  experiments  we  confirmed  that  our  approach  to  the  opinion  retrieval  is 
really  effective  and  robust,  also  in  absence  of  ad  hoc  solutions  for  the  detec¬ 
tion  of  spam  and  of  no  english  documents.  The  Dfreel  model  has  proved  itself 
to  overperform  the  PL2  one.  As  concerns  the  polarity  detection,  we  failed  in 
achieving  acceptable  results.  We  think  that  the  main  motivation  of  this  fail¬ 
ure  has  reference  to  the  scarce  effectiveness  of  our  sentimental  dictionaries  in 
properly  classifying  documents.  May  be  the  adoption  of  an  approach  based  on 
passage  retrieval  could  be  the  proper  solution  to  this  problem. 
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BL2 

4.2%  —  1.2% 

3.0%  —  -0.8% 

3.3%  —  3.4% 

BL3 

6.8%  —  6.9% 

6.1%  —  5.1% 

5.7%  —  6.6% 

BL4 

5.8%  —  4.9% 

4.7%  —  4.3% 

5.3%  —  6.5% 

BL5 

-80.7%  —  -80.5% 

-90.3%  —  -89.4% 

-92.3%  —  -91.9% 

Table  7. 


Run 

MAP 

R-PRec 

A%  MAP 

A%  R-Prec 

FIUpDFRDFR 

0.0569 

0.1058 

-40% 

-22% 

FIUpPL2DFR 

0.0561 

0.1076 

-51% 

-34% 

FIUpBLlDFR 

0.0686 

0.1269 

-41% 

-21% 

FIUpBL2DFR 

0.0560 

0.1074 

-31% 

-13% 

FIUpBL3DFR 

0.0681 

0.1277 

-86% 

-82% 

FIUpBL4DFR 

0.0793 

0.1406 

-51% 

-35% 

FIUpBL5DFR 

0.0158 

0.0285 

-51% 

-34% 

Table  8. 


Run 

MAP 

R-PRec 

A%  MAP 

A%  R-Prec 

FIUpDFRDFR 

0.0484 

0.0821 

-28% 

-20% 

FIUpPL2DFR 

0.0481 

0.0802 

-24% 

-17% 

FIUpBLlDFR 

0.0481 

0.0801 

14% 

12% 

FIUpBL2DFR 

0.0507 

0.0831 

-4% 

-2% 

FIUpBL3DFR 

0.0760 

0.1124 

-70% 

-76% 

FIUpBL4DFR 

0.0640 

0.0981 

-27% 

-18% 

FIUpBL5DFR 

0.0198 

0.0238 

-28% 

-20% 

Table  9. 


